Characteristics , Impact , and Tolerance of Partial Disk Failures
نویسندگان
چکیده
Hard-disk failures are one of the primary causes of data loss in both enterprise storage systems and personal computers. Most disk failures are partial failures, where only some sectors are unavailable due to a latent sector error or some blocks are silently corrupted. This dissertation focuses on all aspects of such partial disk failures – their characteristics, their impact on different systems, and techniques that can be used tolerate them. We perform the first large-scale study of partial disk failures, involving 1.53 million disks in more than 50,000 storage systems. We find that partial disk failures affect a large percentage of disks (e.g., in the worst case, latent sector errors affect up to 20% of the disks in 2 years). We also find that (i) inexpensive SATA drives have a higher probability of developing partial disk failures, (ii) failures are not independent; failures within the same disk have high spatial and temporal locality, and (iii) many failures are detected by background scans of disk blocks called “disk scrubbing.” We examine the impact of partial disk failures on a variety of systems. We use model checking to examine data protection in RAID systems. We find that schemes in many RAID systems are broken; they do not protect against one or more failures, leading to unrecoverable data loss or corrupt data being returned to applications. We apply type-aware fault injection to examine the impact of partial disk failures on the virtual-memory systems of Linux, FreeBSD, and Windows XP. We find that these systems use simplistic or inconsistent failure-handling policies, thus causing data corruption and system-security violations. We analyze the impact of corrupt on-disk pointers on two file systems, NTFS and ext3. We find that these systems do not use available fault-tolerance techniques effectively, resulting in data loss and non-mountable file systems. Overall, we find that a single system cannot be depended upon to reliably store data.
منابع مشابه
Adaptive Checkpointing
Checkpointing is a typical approach to tolerate failures in today’s supercomputing clusters and computational grids. Checkpoint data can be saved either in central stable storage, or in processor memory (as in diskless checkpointing), or local disk space (replacing memory with local disk in diskless checkpointing). But where to save the checkpoint data has a great impact on the performance of a...
متن کاملModeling and Performance Comparison of Reliability Strategies for Distributed Video Servers
Large scale video servers are typically based on disk arrays that comprise multiple nodes and many hard disks. Due to the large number of components, disk arrays are susceptible to disk and node failures that can affect the server reliability. Therefore, fault-tolerance must be already addressed in the design of the video server. For fault-tolerance, we consider parity-based as well as mirrorin...
متن کاملHigh-fidelity reliability simulation of XOR-based erasure codes
Erasure codes are the means by which storage systems are typically made reliable. Recent high profile studies of disk failure and sector failures indicate that ever more fault tolerant erasure codes are needed. Many traditional RAID approaches, parity-check array codes (e.g.,EVENODD, RDP, and X-code), and MDS codes offer two and three disk fault tolerant schemes. There are also many novel erasu...
متن کاملIn Search of I/O-Optimal Recovery from Disk Failures
We address the problem of minimizing the I/O needed to recover from disk failures in erasure-coded storage systems. The principal result is an algorithm that finds the optimal I/O recovery from an arbitrary number of disk failures for any XOR-based erasure code. We also describe a family of codes with high-fault tolerance and low recovery I/O, e.g. one instance tolerates up to 11 failures and r...
متن کاملDisk Array Storage System
Fault tolerance requirements for near term disk array storage systems are analyzed. The excellent reliability provided by RAID Level 5 data organization is seen to be insuucient for these systems. We consider various alternatives { improved MTBF and MTTR times as well as smaller reliability groups and increased numbers of check disks per group { to obtain the necessary improved reliability. The...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007